This document aims to provide further examples in how to use the hpgltools.
Note to self, the header has rmarkdown::pdf_document instead of html_document or html_vignette because it gets some bullcrap error ‘margins too large’…
Here are the commands I invoke to get ready to play with new data, including everything required to install hpgltools, the software it uses, and the fission data.
if (! "BiocInstaller" %in% installed.packages()) {
source("http://bioconductor.org/biocLite.R")
}
if (! "devtools" %in% installed.packages()) {
biocLite("devtools")
}
if (! "hpgltools" %in% installed.packages()) {
devtools::install_github("elsayed-lab/hpgltools")
}
if (! "fission" %in% installed.packages()) {
biocLite("fission")
}
## I use the function 'sm()' to quiet loud functions.
tt <- sm(library(fission))
tt <- sm(data(fission))
All the work I do in Dr. El-Sayed’s lab makes some pretty hard assumptions about how data is stored. As a result, to use the fission data set I will do a little bit of shenanigans to match it to the expected format. Now that I have played a little with fission, I think its format is quite nice and am likely to have my experiment class instead be a SummarizedExperiment.
## Extract the meta data from the fission dataset
meta <- as.data.frame(fission@colData)
## Make conditions and batches
meta$condition <- paste(meta$strain, meta$minute, sep=".")
meta$batch <- meta$replicate
meta$sample.id <- rownames(meta)
## Grab the count data
fission_data <- fission@assays$data$counts
## This will make an experiment superclass called 'expt' and it contains
## an ExpressionSet along with any arbitrary additional information one might want to include.
## Along the way it writes a Rdata file which is by default called 'expt.Rdata'
fission_expt <- create_expt(metadata=meta, count_dataframe=fission_data)
## Reading the sample metadata.
## The sample definitions comprises: 36 rows(samples) and 7 columns(metadata fields).
## Matched 7039 annotations and counts.
## Bringing together the count matrix and gene information.
## The final expressionset has 7039 rows and 36 columns.
There are lots of toys we have learned to use to play with with raw data and explore stuff like batch effects or non-canonical distributions or skewed counts. hpgltools provides some functionality to make this process easier. The graphs shown below and many more are generated with the wrapper ‘graph_metrics()’ but that takes away the chance to explain the graphs as I generate them.
## First make a bar plot of the library sizes in the experiment.
## Notice that the colors were auto-chosen by create_expt() and they should
## be maintained throughout this process
fis_libsize <- plot_libsize(fission_expt)
fis_libsize$plot
## Here we see that the wild type replicate 3 sample for 15 minutes has fewer non-zero genes than all its friends.
fis_nonzero <- plot_nonzero(fission_expt, labels="boring", title="nonzero vs. cpm")
fis_nonzero$plot
fis_density <- plot_density(fission_expt)
## This data will benefit from being displayed on the log scale.
## If this is not desired, set scale='raw'
## Some entries are 0. We are on log scale, setting them to 0.5.
## Changed 24130 zero count features.
fis_density$plot
fis_density$condition_summary
## condition min 1st median mean 3rd max
## 1: wt.0 0.5 31 216 1839 647 5346309
## 2: wt.15 0.5 20 135 1819 474 6733064
## 3: wt.30 0.5 27 183 1917 559 6722419
## 4: wt.60 0.5 27 195 1676 627 5034452
## 5: wt.120 0.5 25 183 1540 529 4531663
## 6: wt.180 0.5 29 210 1904 600 7223993
## 7: mut.0 0.5 35 238 1826 716 4818286
## 8: mut.15 0.5 22 142 1582 461 4387929
## 9: mut.30 0.5 33 226 2028 684 6061779
## 10: mut.60 0.5 39 279 1981 772 5268796
## 11: mut.120 0.5 35 242 1763 675 4174104
## 12: mut.180 0.5 24 178 1534 510 4276682
fis_density$batch_summary
## batch min 1st median mean 3rd max
## 1: r1 0.5 30 205 1908 632 6733064
## 2: r2 0.5 28 193 1727 586 7223993
## 3: r3 0.5 27 196 1717 600 6722419
fis_density$sample_summary
## sample min 1st median mean 3rd max
## 1: GSM1368273 0.5 46.0 306 2225.5 869.0 4542884
## 2: GSM1368274 0.5 21.0 147 1344.9 411.0 4117160
## 3: GSM1368275 0.5 34.0 243 1946.5 678.0 5346309
## 4: GSM1368276 0.5 34.0 226 2625.1 751.5 6733064
## 5: GSM1368277 0.5 20.0 123 1470.8 393.0 4094540
## 6: GSM1368278 0.5 13.0 95 1360.3 312.0 4714248
## 7: GSM1368279 0.5 38.0 257 2321.0 754.5 5624485
## 8: GSM1368280 0.5 19.0 138 1528.6 406.0 4966659
## 9: GSM1368281 0.5 26.0 182 1902.2 517.0 6722419
## 10: GSM1368282 0.5 15.0 106 996.5 301.0 3160098
## 11: GSM1368283 0.5 50.0 357 2451.7 1009.5 5034452
## 12: GSM1368284 0.5 31.0 229 1578.8 636.0 3781322
## 13: GSM1368285 0.5 33.0 238 1784.7 665.0 3855298
## 14: GSM1368286 0.5 23.0 170 1386.8 481.5 3419088
## 15: GSM1368287 0.5 21.0 161 1447.8 452.0 4531663
## 16: GSM1368288 0.5 32.0 224 1853.6 614.5 4562092
## 17: GSM1368289 0.5 35.0 252 2351.9 704.0 7223993
## 18: GSM1368290 0.5 22.0 172 1505.2 473.0 4711077
## 19: GSM1368291 0.5 23.0 154 1325.8 439.0 3317558
## 20: GSM1368292 0.5 39.0 267 1821.3 738.5 4093207
## 21: GSM1368293 0.5 52.0 357 2330.7 1009.5 4818286
## 22: GSM1368294 0.5 19.0 120 1473.7 385.0 3656215
## 23: GSM1368295 0.5 25.5 168 1713.5 521.5 4387929
## 24: GSM1368296 0.5 22.0 147 1557.8 469.0 4337853
## 25: GSM1368297 0.5 37.0 255 2580.2 775.5 6061779
## 26: GSM1368298 0.5 28.0 189 1721.2 580.0 4262642
## 27: GSM1368299 0.5 35.0 239 1781.4 713.0 3280103
## 28: GSM1368300 0.5 41.0 294 2026.0 824.5 3590986
## 29: GSM1368301 0.5 36.0 248 1845.1 674.0 5268796
## 30: GSM1368302 0.5 39.0 302 2072.8 833.5 5229526
## 31: GSM1368303 0.5 36.0 250 1884.7 699.5 3935733
## 32: GSM1368304 0.5 33.0 224 1670.9 612.0 4174104
## 33: GSM1368305 0.5 36.0 260 1733.5 725.0 3387186
## 34: GSM1368306 0.5 29.0 203 1798.8 582.5 4276682
## 35: GSM1368307 0.5 26.0 184 1416.2 513.5 3266544
## 36: GSM1368308 0.5 19.0 153 1385.7 433.0 4232664
## sample min 1st median mean 3rd max
In most cases, raw data does not cluster very well, lets see if that is also true for the fission experiment. Assuming it doesn’t, lets normalize the data using the defaults (cpm, quantile, log2) and try again.
## Something in this is causing a build loop on travis...
## Unsurprisingly, the raw data doesn't cluster well at all...
##fis_rawpca <- plot_pca(fission_expt, expt_names=fission_expt$condition, cis=NULL)
fis_rawpca <- plot_pca(fission_expt)
fis_rawpca$plot
## So, normalize the data
norm_expt <- sm(normalize_expt(fission_expt, transform="log2", norm="quant", convert="cpm"))
## Error in preprocessCore::normalize.quantiles(as.matrix(count_table)): ERROR; return code from pthread_create() is 22
## And try the pca again
fis_normpca <- plot_pca(norm_expt, plot_labels="normal", title="normalized pca", cis=NULL)
## Error in plot_pca(norm_expt, plot_labels = "normal", title = "normalized pca", : object 'norm_expt' not found
fis_normpca$plot
## Error in eval(expr, envir, enclos): object 'fis_normpca' not found
summary(fis_normpca)
## Error in summary(fis_normpca): object 'fis_normpca' not found
testing <- plot_pca(norm_expt, num_pc=3)
## Error in plot_pca(norm_expt, num_pc = 3): object 'norm_expt' not found
silly <- make_3d_pca(testing)
## Error in make_3d_pca(testing): could not find function "make_3d_pca"
silly$plot
## Error in eval(expr, envir, enclos): object 'silly' not found
In the final line of the preceeding block, I printed a summary of the return from plot_pca(). It contains the following information:
With that in mind, lets perform some more pca plots after normalizing the data and see how different they look.
normbatch_expt <- sm(normalize_expt(fission_expt, transform="log2", norm="quant",
convert="cpm", batch="sva"))
## Error in preprocessCore::normalize.quantiles(as.matrix(count_table)): ERROR; return code from pthread_create() is 22
fis_normbatchpca <- plot_pca(normbatch_expt,
title="Normalized PCA with batch effect correction.", cis=NULL)
## Error in plot_pca(normbatch_expt, title = "Normalized PCA with batch effect correction.", : object 'normbatch_expt' not found
fis_normbatchpca$plot
## Error in eval(expr, envir, enclos): object 'fis_normbatchpca' not found
## ok, that caused the 0, 60, 15, and 30 minute samples to cluster nicely
## the 120 and 180 minute samples are still a bit tight
## pca_information provides some more information about the call to
## fast.svd that went into making the pca plot
fis_info <- pca_information(norm_expt,
expt_factors=c("condition","batch","strain","minute"),
num_components=6)
## Error in pca_information(norm_expt, expt_factors = c("condition", "batch", : object 'norm_expt' not found
## The r^2 table shows that quite a lot of the variance in the data is explained by condition
head(fis_info$rsquared_table)
## Error in head(fis_info$rsquared_table): object 'fis_info' not found
## We can look at the correlation between the principle components and the factors in the experiment
## in this case looking at condition/batch vs the first 4 components.
fis_info$pca_cor
## Error in eval(expr, envir, enclos): object 'fis_info' not found
## And p-values to lend some credence(or not to those assertions)
fis_info$anova_p
## Error in eval(expr, envir, enclos): object 'fis_info' not found
## Try again with batch removed data
batchnorm_expt <- sm(normalize_expt(fission_expt, batch="limma", norm="quant",
transform="log2", convert="cpm"))
## Error in preprocessCore::normalize.quantiles(as.matrix(count_table)): ERROR; return code from pthread_create() is 22
fis_batchnormpca <- plot_pca(batchnorm_expt, plot_title="limma corrected pca")
## Error in plot_pca(batchnorm_expt, plot_title = "limma corrected pca"): object 'batchnorm_expt' not found
fis_batchnormpca$plot
## Error in eval(expr, envir, enclos): object 'fis_batchnormpca' not found
test_pca <- pca_information(batchnorm_expt,
expt_factors=c("condition","batch","strain","minute"),
num_components=6)
## Error in pca_information(batchnorm_expt, expt_factors = c("condition", : object 'batchnorm_expt' not found
Interesting, the batch normalized pca plot looks much the same as the normalized. The variances are in fact pretty much the exact same…
We have some tools which provide visualizations of the distribution of the data:
fission_boxplot <- sm(plot_boxplot(fission_expt))
fission_boxplot
sf_expt <- sm(normalize_expt(fission_expt, norm="sf"))
fission_boxplot <- sm(plot_boxplot(sf_expt))
fission_boxplot
tm_expt <- sm(normalize_expt(fission_expt, norm="tmm"))
fission_boxplot <- sm(plot_boxplot(tm_expt))
fission_boxplot
rle_expt <- sm(normalize_expt(fission_expt, norm="rle"))
fission_boxplot <- sm(plot_boxplot(rle_expt))
fission_boxplot
up_expt <- sm(normalize_expt(fission_expt, norm="upperquartile"))
fission_boxplot <- sm(plot_boxplot(up_expt))
fission_boxplot
fission_density <- plot_density(norm_expt)
## Error in plot_density(norm_expt): object 'norm_expt' not found
fission_density$plot
## Error in eval(expr, envir, enclos): object 'fission_density' not found
fission_density <- plot_density(sf_expt)
## This data will benefit from being displayed on the log scale.
## If this is not desired, set scale='raw'
## Some entries are 0. We are on log scale, setting them to 0.5.
## Changed 24130 zero count features.
fission_density$plot
fission_density <- plot_density(tm_expt)
## This data will benefit from being displayed on the log scale.
## If this is not desired, set scale='raw'
## Some entries are 0. We are on log scale, setting them to 0.5.
## Changed 24130 zero count features.
fission_density$plot
compare_12 <- plot_single_qq(fission_expt, x=1, y=2)
compare_12$log
Ok, so we can further check out how the data cluster with respect to one another…
fission_cor <- plot_corheat(norm_expt)
## Error in plot_heatmap(expt_data, expt_colors = expt_colors, expt_design = expt_design, : object 'norm_expt' not found
fission_cor$plot
## Error in eval(expr, envir, enclos): object 'fission_cor' not found
fission_cor <- plot_corheat(batchnorm_expt)
## Error in plot_heatmap(expt_data, expt_colors = expt_colors, expt_design = expt_design, : object 'batchnorm_expt' not found
fission_cor$plot
## Error in eval(expr, envir, enclos): object 'fission_cor' not found
fission_dis <- plot_disheat(norm_expt)
## Error in plot_heatmap(expt_data, expt_colors = expt_colors, expt_design = expt_design, : object 'norm_expt' not found
fission_dis$plot
## Error in eval(expr, envir, enclos): object 'fission_dis' not found
fission_dis <- plot_disheat(batchnorm_expt)
## Error in plot_heatmap(expt_data, expt_colors = expt_colors, expt_design = expt_design, : object 'batchnorm_expt' not found
fission_dis$plot
## Error in eval(expr, envir, enclos): object 'fission_dis' not found
variancePartition may be used to seek out which experimental factors correlate with the most variance in the data.
test_varpart <- varpart(fission_expt, predictor=NULL, factors=c("condition","batch"))
## Error in varpart(fission_expt, predictor = NULL, factors = c("condition", : could not find function "varpart"
test_varpart$percent_plot
## Error in eval(expr, envir, enclos): object 'test_varpart' not found
test_varpart$partition_plot
## Error in eval(expr, envir, enclos): object 'test_varpart' not found
## Here, let us test the variance contributed by strain, time, and replicate.
test_varpart <- varpart(fission_expt, predictor=NULL, factors=c("condition", "strain", "minute", "replicate"))
## Error in varpart(fission_expt, predictor = NULL, factors = c("condition", : could not find function "varpart"
test_varpart$percent_plot
## Error in eval(expr, envir, enclos): object 'test_varpart' not found
test_varpart$partition_plot
## Error in eval(expr, envir, enclos): object 'test_varpart' not found
YAY!
pander::pander(sessionInfo())
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
locale: LC_CTYPE=en_US.UTF-8, LC_NUMERIC=C, LC_TIME=en_US.UTF-8, LC_COLLATE=en_US.UTF-8, LC_MONETARY=en_US.UTF-8, LC_MESSAGES=en_US.UTF-8, LC_PAPER=en_US.UTF-8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.UTF-8 and LC_IDENTIFICATION=C
attached base packages: stats4, parallel, stats, graphics, grDevices, utils, datasets, methods and base
other attached packages: fission(v.1.4.0), hpgltools(v.1.0), SummarizedExperiment(v.1.14.0), DelayedArray(v.0.10.0), BiocParallel(v.1.18.0), matrixStats(v.0.54.0), GenomicRanges(v.1.36.0), GenomeInfoDb(v.1.20.0), IRanges(v.2.18.1), S4Vectors(v.0.22.0), Biobase(v.2.44.0) and BiocGenerics(v.0.30.0)
loaded via a namespace (and not attached): backports(v.1.1.4), fastmatch(v.1.1-0), igraph(v.1.2.4.1), plyr(v.1.8.4), selectr(v.0.4-1), lazyeval(v.0.2.2), splines(v.3.6.1), usethis(v.1.5.1), ggplot2(v.3.2.0), urltools(v.1.7.3), sva(v.3.32.1), digest(v.0.6.20), foreach(v.1.4.4), htmltools(v.0.3.6), GOSemSim(v.2.10.0), viridis(v.0.5.1), GO.db(v.3.8.2), gdata(v.2.18.0), magrittr(v.1.5), memoise(v.1.1.0), doParallel(v.1.0.14), limma(v.3.40.2), remotes(v.2.1.0), Biostrings(v.2.52.0), readr(v.1.3.1), annotate(v.1.62.0), enrichplot(v.1.4.0), prettyunits(v.1.0.2), colorspace(v.1.4-1), blob(v.1.2.0), rvest(v.0.3.4), ggrepel(v.0.8.1), xfun(v.0.8), dplyr(v.0.8.3), jsonlite(v.1.6), callr(v.3.3.0), crayon(v.1.3.4), RCurl(v.1.95-4.12), genefilter(v.1.66.0), lme4(v.1.1-21), zeallot(v.0.1.0), survival(v.2.44-1.1), iterators(v.1.0.10), glue(v.1.3.1), polyclip(v.1.10-0), gtable(v.0.3.0), zlibbioc(v.1.30.0), XVector(v.0.24.0), UpSetR(v.1.4.0), pkgbuild(v.1.0.3), DESeq(v.1.36.0), scales(v.1.0.0), DOSE(v.3.10.2), DBI(v.1.0.0), edgeR(v.3.26.5), Rcpp(v.1.0.1), viridisLite(v.0.3.0), genoPlotR(v.0.8.9), xtable(v.1.8-4), progress(v.1.2.2), gridGraphics(v.0.4-1), europepmc(v.0.3), bit(v.1.1-14), preprocessCore(v.1.46.0), httr(v.1.4.0), fgsea(v.1.10.0), gplots(v.3.0.1.1), RColorBrewer(v.1.1-2), farver(v.1.1.0), pkgconfig(v.2.0.2), XML(v.3.98-1.20), locfit(v.1.5-9.1), ggplotify(v.0.0.3), tidyselect(v.0.2.5), labeling(v.0.3), rlang(v.0.4.0), reshape2(v.1.4.3), AnnotationDbi(v.1.46.0), munsell(v.0.5.0), tools(v.3.6.1), cli(v.1.1.0), RSQLite(v.2.1.1), ade4(v.1.7-13), ggridges(v.0.5.1), devtools(v.2.1.0), evaluate(v.0.14), stringr(v.1.4.0), yaml(v.2.2.0), processx(v.3.4.0), knitr(v.1.23), bit64(v.0.9-7), fs(v.1.3.1), pander(v.0.6.3), caTools(v.1.17.1.2), purrr(v.0.3.2), ggraph(v.1.0.2), nlme(v.3.1-140), DO.db(v.2.9), xml2(v.1.2.0), biomaRt(v.2.40.1), compiler(v.3.6.1), pbkrtest(v.0.4-7), rstudioapi(v.0.10), curl(v.3.3), variancePartition(v.1.14.0), testthat(v.2.1.1), geneplotter(v.1.62.0), tweenr(v.1.0.1), tibble(v.2.1.3), stringi(v.1.4.3), highr(v.0.8), ps(v.1.3.0), GenomicFeatures(v.1.36.3), desc(v.1.2.0), lattice(v.0.20-38), Matrix(v.1.2-17), nloptr(v.1.2.1), vctrs(v.0.2.0), pillar(v.1.4.2), triebeard(v.0.3.0), cowplot(v.0.9.4), data.table(v.1.12.2), bitops(v.1.0-6), corpcor(v.1.6.9), qvalue(v.2.16.0), rtracklayer(v.1.44.0), colorRamps(v.2.3), R6(v.2.4.0), directlabels(v.2018.05.22), gridExtra(v.2.3), KernSmooth(v.2.23-15), sessioninfo(v.1.1.1), codetools(v.0.2-16), boot(v.1.3-23), MASS(v.7.3-51.4), gtools(v.3.8.1), assertthat(v.0.2.1), pkgload(v.1.0.2), rprojroot(v.1.3-2), withr(v.2.1.2), GenomicAlignments(v.1.20.1), Rsamtools(v.2.0.0), GenomeInfoDbData(v.1.2.1), mgcv(v.1.8-28), hms(v.0.5.0), clusterProfiler(v.3.12.0), quadprog(v.1.5-7), grid(v.3.6.1), tidyr(v.0.8.3), minqa(v.1.2.4), rvcheck(v.0.1.3), rmarkdown(v.1.13), Rtsne(v.0.15), ggforce(v.0.2.2) and base64enc(v.0.1-3)